Homework 1 Instructions

DKU Stats 101 Spring 2024 Session 4

Author

Anonymous

Published

March 25, 2024

Assignment Background

Imagine that you are an analyst at a consulting firm hired by a boat dealer who is interested in trying to make money buying undervalued boats and reselling them to the global market. Your boss asks you to write a report that identifies the key features of boats that best determine their price. She also sends you this dataset of recent boat sales on the website boattrader.com. Use this dataset to answer the questions below. 1

Assignment Instructions

  • Please leave the name as Anonymous; the homeworks are blind graded so please do not include your name.
  • Save this document as a new document (Save As…) and rename it Homework 1 Answers.qmd.
  • You can import the dataset by inserting the code boats <- read.csv("boats.csv") into your setup block. Make sure the data file is in the same folder as your homework file.
  • If I say “Interpret…” or “Explain…” that means I want at least 1-2 good quality sentences that show that you really understand the output of what R has produced. Short, incomplete sentences that fail to demonstrate you understand your output will have points deducted.
  • While the homework isn’t a formal document, it should be written as if you are a business analyst presenting this information to a boss - i.e. everything should look neat and tidy, have good labels (graphs have appropriate titles and axes labels, etc.), and be well organized
  • If you want a chance to earn extra credit, select your best graph or table (only one per student) and post it to the Graph Contest Teams channel. I will select a few finalists and the class will vote on the best data display. Winner receives significant extra credit.
  • Delete the Assignment Background and Assignment Instructions sections of this document before submitting.

Part 1: One variable analysis

Q1: What kind of dataset do we have? (5 points)

  • According to the definitions in the textbook, describe the Five W’s for this dataset.

  • Using the definitions in the textbook, describe the variable type for the following variables:

    • id
    • type
    • boatClass
    • year
    • condition
    • length_ft
    • beam_ft
    • dryWeight_lb
    • price
    • sellerId

Q2: Literature review (5 points)

Find a news article online that discusses what are the major features that determine what people are looking for boats and include the link in this section.

Based on the article and your own personal expectations, what are some ways we might expect the data to be distributed or variables related? Make a list of at least three things we should expect or look for in the data and write a reason why we should expect it (no need to cite academic papers, just write down your reasons). Reasons should be thoughtful and at least two sentences explaining your logic for the expectation.

Q3: Describing the data (10 points)

The first step in analyzing any dataset is doing some exploratory analysis of the variables.

  • Make a histogram of price.
    • Describe it using the three features of quantitative data.
    • Does the histogram of price surprise you? Why or why not?
    • Which is a better measure of center of the histogram, mean or median?
    • Make a nice table displaying the 5 number summary. You can make a nice table using either kable or with the visual editing mode of Quarto. Calculate the five number summary by using the min() , quantile(), median(), and max() functions to do this. Show your code in the document (echo: true).
    • Calculate the standard deviation using the sd() function. Interpret it - is it large or small? How does it compare to the IQR? What does this tell you about the shape of the distribution?
    • Would this histogram benefit from a transformation, in your opinion? Why or why not? If it would, please transform it appropriately, make a new histogram, and describe the transformation.
  • Make a boxplot chart comparing the median of price according to the variable condition. If you previously transformed your data, keep it transformed for this step. Interpret this graph.
    • Hint: R imports datasets by default keeping text as text (strings in R world). Remember the DataCamp method to convert the strings into factors.

Q4: Comparing categorical variables (5 points)

  • One interesting piece of information the dealer would like to know is if there is any change in the types of boats being sold according to what type of fuel it uses. In particular, they wonder if some fuel types are becoming less popular and therefore more difficult to sell in new boats. The best way to examine this relationship is with a contingency table.
    • Hint: documentation on how to produce contingency tables (also called crosstab tables) can be found here or you can also use the taybl() function from the package janitor - instructions here.
  • Add margins to your table. Does it change your interpretation?
  • Now convert your table into a proportions table. Does this better help explain what the data show?
  • What should you recommend to the dealer? What are some additional questions or areas of research do the results suggest you should look into?

Q5: Understanding and comparing distributions (5 points)

  • Using the five number summaries, calculate if length_ft has any outliers according to the rule described in the textbook for outliers in boxplots. Show your calculations. Do you believe the outliers identified are real outliers? Why or why not? Consider the purpose of your report when preparing your answer.
  • Create a graph of boxplot of length_ft by fuelType. What can you conclude from this display? Would any of these subgroups benefit from having length_ft re-expressed? Why or why not?

Q6: The Normal distribution (5 points)

For boats under 20 feet, a rough rule of thumb is that the maximum weight capacity in kilograms of a boat is approximately the length in feet times the width (or beam) of the boat times 5. In a 2006 study of Americans, the average weight was approximately 70 kilos with a standard deviation of 11 kilos.

  • In formal notation, write the Normal model of human weight.
  • Select at least five boats that are under 20 feet from the database. Make a table of the boats.
  • The title of the columns of the table should be z scores ranging from -2 to +2
  • Calculate what is the maximum number of passengers the boat can hold if all of the passengers each weigh the following z scores: (-2, -1, 0, 1, 2), show your calculations below your table.
  • If you are a boat maker, what cutoff would you set for boat passenger capacity and why?
  • Look up three of the boats in your table on the internet and write down what the manufacturer set as the recommended number of people on the boat. What z score cutoff does it seem the boatmaker made in each of the cases?

Part 2: Two variable analyis

Q7: Relationship between variables (15 points)

  • Make a scatterplot of price as a function of length_ft. Add a linear smoother to the plot and label any points you consider to be an outlier using geom_text() - the label for the outlier should print the observation’s id. If necessary, transform any variables
    1. Do you think there is a clear pattern? Describe the association between price and length_ft.
    2. Find out the details of any outliers you have identified. Do you think the outlier(s) should be excluded from the analysis? Why or why not?
  • Make a second graph excluding any outliers you have identified
    1. What do you estimate the correlation to be, without using technology?
    2. Check the conditions for correlation
    3. Find and interpret the correlation coefficient for this relationship
    4. Interpret this graph. What is some useful information we could communicate to the client? What questions for additional investigation does this graph prompt for you?
  • Now, make a third graph of the same two variables but color it by condition. Add a linear smoother for each condition.
    1. How does this graphical display change your interpretation you developed in your answer to part 6? Why do you think you the relationship is structured like this? Explain.

Q8: Putting it all together (15 points)

Through the analysis conducted in the previous section and through at least one additional investigation of your own (which can be an additional graph or table, that analyzes a different relationship or distribution than one asked about in the questions above but you think is meaningful and important to communicate the client), write three paragraphs outlining what you think are the main findings of questions 1-7 plus your additional investigation. What would you recommend to your boss as to what types of boats sell for the most money? What information are we missing in this dataset that we would need to better understand boat price?

Footnotes

  1. Credit to user MEXWELL for making this data publicly available at https://www.kaggle.com/datasets/mexwell/boat-price-prediction↩︎